1
00:00:00,400 --> 00:00:08,610
Hello, everyone, welcome back to the
Heterogeneous Parallel Programming class.

2
00:00:10,050 --> 00:00:14,910
This is lecture 1.4, Introduction to CUDA,
and we're

3
00:00:14,910 --> 00:00:18,460
going to be talking about Data Parallelism
and Threads.

4
00:00:20,110 --> 00:00:22,030
The objective of this lecture is to help
you

5
00:00:22,030 --> 00:00:25,660
to learn about data parallelism and the
basic features of

6
00:00:25,660 --> 00:00:30,890
CUDA C, which is a heterogeneous parallel
programming interface that enables the

7
00:00:30,890 --> 00:00:36,070
exploitation of data parallelism by using
both CPUs and GPUs.

8
00:00:37,180 --> 00:00:39,830
And, the topics that we're going to

9
00:00:39,830 --> 00:00:43,480
cover today are the hierarchical thread
organization,

10
00:00:43,480 --> 00:00:46,550
the main interfaces for launching parallel
execution,

11
00:00:46,550 --> 00:00:49,100
and the thread index to data index
mapping.

12
00:00:51,960 --> 00:00:58,530
The phenomena of data parallelism is that,
different parts of the data can actually

13
00:00:58,530 --> 00:01:04,110
be processed independently of each other.
A very simple example is vector addition.

14
00:01:04,110 --> 00:01:05,870
When we add two vectors together,

15
00:01:05,870 --> 00:01:08,718
the elements can be added together
independently.

16
00:01:08,718 --> 00:01:13,168
A[0] and B[0] can be added into, to form
C[0]

17
00:01:13,168 --> 00:01:17,974
and A[1] and B[1] can added to form C[1]
independently

18
00:01:17,974 --> 00:01:19,990
of each other.

19
00:01:19,990 --> 00:01:25,180
So, if we have a large number of elements
in each vectors, and, we have a

20
00:01:25,180 --> 00:01:27,790
large amount of hardware, we should be
able

21
00:01:27,790 --> 00:01:32,150
to eh, to perform all these additions in
parallel.

22
00:01:32,150 --> 00:01:35,400
And that's how we can fundamentally,

23
00:01:35,400 --> 00:01:40,400
achieve high performance in a, CUDA,
program.

24
00:01:40,400 --> 00:01:43,050
So that's the reason why we're going to be
using

25
00:01:43,050 --> 00:01:47,420
this very simple example to illustrate the
basic concepts of CUDA.

26
00:01:50,500 --> 00:01:55,970
The execute, parallel execution model of
CUDA and its close relative Open CL

27
00:01:55,970 --> 00:02:02,370
are both based on a host plus device kind
of, arrangement.

28
00:02:02,370 --> 00:02:07,370
So, the, the basic concept is that when we
start executing a piece of

29
00:02:07,370 --> 00:02:12,930
code or application, the application will
be executing on the host.

30
00:02:12,930 --> 00:02:15,955
The host is typically a CPU core.

31
00:02:15,955 --> 00:02:23,810
And when we start to, to, execute into a
parallel part of the application, that's

32
00:02:23,810 --> 00:02:30,840
when we, we start to have an opportunity
to, to use a throughput oriented, device.

33
00:02:30,840 --> 00:02:33,140
So this is done by writing

34
00:02:33,140 --> 00:02:36,290
some specialized functions called kernel
functions.

35
00:02:36,290 --> 00:02:38,020
These kernel functions are very similar

36
00:02:38,020 --> 00:02:41,840
to, functions in the C programming
language.

37
00:02:41,840 --> 00:02:47,160
They also take parameters or arguments.
So that's why I showed that,

38
00:02:47,160 --> 00:02:52,020
for a kernel function called KernelA, it
will take arguments

39
00:02:52,020 --> 00:02:56,710
just like the C functions.
However, they also take configuration

40
00:02:56,710 --> 00:03:02,270
parameters and, these configuring
parameters are specially noted

41
00:03:02,270 --> 00:03:06,980
by three less signs in the front and three
greater-than signs in

42
00:03:06,980 --> 00:03:08,170
the back.

43
00:03:08,170 --> 00:03:13,670
In between, it gives the, a configuration
parameter of number of thread

44
00:03:13,670 --> 00:03:19,180
blocks, in the grid and number of threads
in a thread block.

45
00:03:19,180 --> 00:03:25,480
So in this, example on the right hand
side, we show that the, the kernel will be

46
00:03:25,480 --> 00:03:28,430
executed by a number of thread blocks
which

47
00:03:28,430 --> 00:03:32,200
are shown as rectangular blocks with, each
with several

48
00:03:32,200 --> 00:03:33,560
threads in them.

49
00:03:33,560 --> 00:03:35,720
And then, within each thread block, we
will

50
00:03:35,720 --> 00:03:39,120
have a number of threads, all executing in
parallel.

51
00:03:40,270 --> 00:03:46,920
So then, after the execution of the kernel
function, we will return to a sequential

52
00:03:46,920 --> 00:03:49,630
execution, sequential part of the
application, so

53
00:03:49,630 --> 00:03:53,608
we return to the host for sequential
execution.

54
00:03:53,608 --> 00:03:57,270
And then, while we reach another part of
the application

55
00:03:57,270 --> 00:04:00,580
where we have an opportunity, we write

56
00:04:00,580 --> 00:04:04,030
another KernelB that can be executed in
parallel.

57
00:04:04,030 --> 00:04:07,390
So the execution will go between host and

58
00:04:07,390 --> 00:04:12,340
device, device in CUDA means parallel
execution device.

59
00:04:12,340 --> 00:04:22,200
And, most of the time, the device are,
correspond to throughput-oriented GPUs.

60
00:04:22,200 --> 00:04:28,790
This picture shows a, the levels of
attraction in a computer system.

61
00:04:28,790 --> 00:04:32,678
When we, typically, when we have an
application,

62
00:04:32,678 --> 00:04:36,890
the application solves a problem at the
human level.

63
00:04:36,890 --> 00:04:39,740
So we will have natural language
description of the

64
00:04:39,740 --> 00:04:43,810
problems that, that the application should
be, should solve.

65
00:04:43,810 --> 00:04:47,600
And then, based on the description of the
problem, we can,

66
00:04:47,600 --> 00:04:52,570
we'll define algorithms which have very
well-defined steps of

67
00:04:52,570 --> 00:04:58,660
computation and a well defined criteria
for terminating the computation.

68
00:04:58,660 --> 00:05:02,630
And then the algorithms will be
implemented with, programming

69
00:05:02,630 --> 00:05:05,560
languages such as C and C++ and so on.

70
00:05:05,560 --> 00:05:11,480
CUDA is really a programming language at
the C level.

71
00:05:11,480 --> 00:05:13,350
And it's actually

72
00:05:13,350 --> 00:05:17,720
designed as an extension to the C
language.

73
00:05:17,720 --> 00:05:24,420
And, more recently, more and more of the
C++ features are also available in CUDA.

74
00:05:24,420 --> 00:05:30,188
So CUDA is becoming more and more of a C++
programming language extension.

75
00:05:30,188 --> 00:05:35,110
However, in order to fully understand the
behavior of the CUDA

76
00:05:35,110 --> 00:05:38,740
programs, often I will have to go a little
bit lower into

77
00:05:38,740 --> 00:05:40,870
the abstraction layers and look at

78
00:05:40,870 --> 00:05:43,930
the instruction set architecture and
micro-architecture,

79
00:05:43,930 --> 00:05:50,160
which is the organization of hardware to
be able to, to execute programs.

80
00:05:50,160 --> 00:05:55,120
And, so therefore, we're going to actually
go down a little bit in today's lecture,

81
00:05:55,120 --> 00:05:57,310
just so that, you can have a

82
00:05:57,310 --> 00:06:01,940
solid understanding of the execution model
of CUDA.

83
00:06:01,940 --> 00:06:04,380
And this is, based on a,

84
00:06:04,380 --> 00:06:08,380
a picture from Patt and Patel,
Introduction to Computing

85
00:06:08,380 --> 00:06:11,350
Systems from Bits to Bytes, and To Gates
and Beyond.

86
00:06:14,570 --> 00:06:19,010
Let's go a little bit into the concept of
ISA or Instruction Set Architecture.

87
00:06:19,010 --> 00:06:24,550
And the Instruction Set Architecture is a
contract between hardware and software.

88
00:06:24,550 --> 00:06:29,340
Mostly, the Instruction Set Architecture,
specifies a

89
00:06:29,340 --> 00:06:33,990
set of instructions that the hardware can
execute.

90
00:06:33,990 --> 00:06:35,840
As long as the software consists of these

91
00:06:35,840 --> 00:06:39,840
instructions, the hardware knows what to
do, according for

92
00:06:39,840 --> 00:06:45,620
each, for the software.
So whenever we write a piece of code in

93
00:06:45,620 --> 00:06:51,160
CUDA, it will eventually be compiled down
to the Instruction Set Architecture level.

94
00:06:51,160 --> 00:06:58,440
And the program is, the program is said to
be at the Instructions Set level whenever

95
00:06:58,440 --> 00:07:04,840
it is compiled down to these instructions.
At this level a program is really

96
00:07:04,840 --> 00:07:07,160
a set of instructions stored in memory
that

97
00:07:07,160 --> 00:07:11,250
can be read, interpreted, and executed by
the hardware.

98
00:07:13,330 --> 00:07:18,440
But program instructions will then operate
on the data

99
00:07:18,440 --> 00:07:21,834
that are stored in memory or provided by
input/output devices.

100
00:07:22,920 --> 00:07:29,740
The next slides shows a, very simplified
diagram of how the hardware is

101
00:07:29,740 --> 00:07:33,260
typically organized to execute programs
that are

102
00:07:33,260 --> 00:07:36,410
represented at the Instruction Set
Architecture level.

103
00:07:36,410 --> 00:07:38,450
And this diagram is

104
00:07:38,450 --> 00:07:42,980
really based on Von-Neumann Processor
model, which was

105
00:07:42,980 --> 00:07:45,430
proposed by John von Neumann in the 1940s.

106
00:07:46,650 --> 00:07:50,780
Today, virtually all the processor cores
are designed

107
00:07:50,780 --> 00:07:54,430
based on this or variations of this
particular model.

108
00:07:55,560 --> 00:07:58,080
So let's start with the bottom.

109
00:07:58,080 --> 00:08:03,670
The bottom here shows the Control Unit and
the Control Unit contains a

110
00:08:03,670 --> 00:08:06,470
program counter and the instruction
register.

111
00:08:06,470 --> 00:08:12,720
The program counter essentially specifies
the location in memory where the hardware

112
00:08:12,720 --> 00:08:18,020
can find the next instruction that should
be executed for the application.

113
00:08:18,020 --> 00:08:22,620
So, there is a dash line going from the
control unit to the memory essentially

114
00:08:22,620 --> 00:08:29,090
that this, the, the communication pass for
the control unit to deliver

115
00:08:29,090 --> 00:08:36,740
the PC value to the memory and demand the
memory to, to return the instruction bits.

116
00:08:36,740 --> 00:08:39,670
Once the instruction bits return from the
memory,

117
00:08:39,670 --> 00:08:44,830
it will be placed into instruction
register, or IR.

118
00:08:44,830 --> 00:08:49,500
That's where the hardware will examine all
the instruction bits and determine

119
00:08:49,500 --> 00:08:53,980
all the activities that need to happen in
order to execute that instruction.

120
00:08:53,980 --> 00:08:57,420
And these activities are actually
coordinated by the

121
00:08:57,420 --> 00:09:01,060
control signals, which are represented
with the, by

122
00:09:01,060 --> 00:09:03,330
the dashed line going from control unit to

123
00:09:03,330 --> 00:09:05,520
the processing unit in the middle of the
picture.

124
00:09:06,620 --> 00:09:10,720
This, these control signals essentially
define the

125
00:09:10,720 --> 00:09:13,910
activities that the ALU, the register
file,

126
00:09:13,910 --> 00:09:19,070
and other components in the com-, the
processing unit need to, need to take in

127
00:09:19,070 --> 00:09:23,807
every clock cycle in order to execute the
instruction.

128
00:09:23,807 --> 00:09:28,080
And, during the execution, some of the
instructions will need to be

129
00:09:28,080 --> 00:09:34,260
able to access data, read or write data
from or to memory.

130
00:09:34,260 --> 00:09:38,730
And that's, indicated by the upward arrow
and the downward

131
00:09:38,730 --> 00:09:43,080
arrow of the, between the processing unit
and the memory.

132
00:09:43,080 --> 00:09:43,990
So,

133
00:09:43,990 --> 00:09:49,440
depending on the type of instruction, the,
the execution will have activities

134
00:09:49,440 --> 00:09:54,520
in ALUs and register files and, and
accessing memory and so on.

135
00:09:54,520 --> 00:09:58,870
Finally, some of the data will be moved

136
00:09:58,870 --> 00:10:01,628
back and forth between the memory and the
I/O.

137
00:10:01,628 --> 00:10:05,960
The I/O represents the network and the
disks

138
00:10:05,960 --> 00:10:08,550
and so on, and the displays and so on.

139
00:10:08,550 --> 00:10:09,150
So the

140
00:10:09,150 --> 00:10:13,020
data will be moved back and forth between
memory and I/O.

141
00:10:13,020 --> 00:10:15,200
And so we're actually going to see a
little bit

142
00:10:15,200 --> 00:10:18,560
of I/O activity as well for the rest of
the class.

143
00:10:21,520 --> 00:10:30,030
So now we are ready to talk about more
specifics about a, of a CUDA thread.

144
00:10:30,030 --> 00:10:35,670
A CUDA thread is really a virtualized or
abstracted Von-Neumann processor.

145
00:10:35,670 --> 00:10:39,980
You can think about every CUDA thread as
one of these processors

146
00:10:39,980 --> 00:10:45,570
and these, each of these processors will
be able to execute a program.

147
00:10:45,570 --> 00:10:46,820
So the kernel

148
00:10:46,820 --> 00:10:49,840
function that we describe is that program.

149
00:10:49,840 --> 00:10:54,370
And the hardware will actually generate

150
00:10:54,370 --> 00:10:57,850
a large number of these Von-Neumann
processors,

151
00:10:57,850 --> 00:11:01,770
and each of them will, hardware provides

152
00:11:01,770 --> 00:11:04,780
a large number of these Von-Neumann
processors.

153
00:11:04,780 --> 00:11:06,840
And each of these processors will

154
00:11:06,840 --> 00:11:11,430
be executing that function, the kernel
function.

155
00:11:11,430 --> 00:11:11,805
So,

156
00:11:11,805 --> 00:11:14,870
but it's virtualized in the sense that, if

157
00:11:14,870 --> 00:11:17,400
you look at the hardware, the number of
real

158
00:11:17,400 --> 00:11:20,650
processors may be much, much smaller than
the

159
00:11:20,650 --> 00:11:25,100
threads that a CUDA program will pro, will
create.

160
00:11:25,100 --> 00:11:29,000
So a lot of these threads will actually
need

161
00:11:29,000 --> 00:11:31,819
to be, to be executed by the real
processor.

162
00:11:33,684 --> 00:11:36,860
In turn.
So some of them will

163
00:11:36,860 --> 00:11:41,650
be actively executing and some of them
will not be actively executing, and this

164
00:11:41,650 --> 00:11:47,070
is what we call context switching, and
that will be, it, we'll

165
00:11:47,070 --> 00:11:52,440
elaborate on that very soon in one of the
future lectures.

166
00:11:54,550 --> 00:11:59,290
So let's go into the way a CUDA

167
00:11:59,290 --> 00:12:04,430
programmer think about threads.
So whenever a

168
00:12:04,430 --> 00:12:09,670
CUDA kernel is executed, it's executed by
a grid or array

169
00:12:09,670 --> 00:12:14,897
of threads.
So here we show a one-dimensional,

170
00:12:15,930 --> 00:12:19,820
thread block.
And let's say, let's assume

171
00:12:19,820 --> 00:12:25,800
for the moment that the grid has only one
thread block for this, for the moment.

172
00:12:25,800 --> 00:12:30,411
So all the threads would, actually, run
the same

173
00:12:30,411 --> 00:12:33,730
code as we described before, but every
thread will have

174
00:12:33,730 --> 00:12:37,340
a different index value, or thread index
value, that it

175
00:12:37,340 --> 00:12:42,170
will use to compute memory addresses and
make controlled decisions.

176
00:12:42,170 --> 00:12:45,950
In this particular example, we show that

177
00:12:45,950 --> 00:12:49,200
there are 256 threads in the third block.

178
00:12:49,200 --> 00:12:54,230
And each of them will have a unique thread
index from 0 to 255.

179
00:12:54,230 --> 00:12:59,050
And there's a piece of code that, in the
kernel, that I'm

180
00:12:59,050 --> 00:13:03,910
showing in the, in the, box underneath,
and that box is

181
00:13:03,910 --> 00:13:10,110
first calculates a i variable based on the
thread index.

182
00:13:10,110 --> 00:13:11,254
This i variable

183
00:13:11,254 --> 00:13:17,238
is actually private to every thread, that
is, every processor, Von-Neumann

184
00:13:17,238 --> 00:13:23,780
processor that correspond to those threads
will have a unique i variable.

185
00:13:23,780 --> 00:13:29,170
Thread ID, thread 0 will have its own i,
thread 1 will have its own i and so on.

186
00:13:30,660 --> 00:13:36,370
Thread 0 will calculate its i value as 0
in this case because

187
00:13:36,370 --> 00:13:42,630
the thread index on x value for thread 0
is 0, so i will be 0 in this case,

188
00:13:42,630 --> 00:13:48,890
for thread 0.
Thread 2 will have i value

189
00:13:48,890 --> 00:13:55,000
2 and so on.
And thread 255 will have i value 255.

190
00:13:55,000 --> 00:14:01,283
So, when we execute the statement, C[i]
equals A[i] plus B[i],

191
00:14:01,283 --> 00:14:05,940
the i value for every thread will be
different.

192
00:14:05,940 --> 00:14:11,270
So thread 0 will be adding A[0] plus B[0]
and assign that to C[0].

193
00:14:11,270 --> 00:14:16,436
And thread 255 will be adding A[255] plus

194
00:14:16,436 --> 00:14:21,300
B[255] and assign that to C[255].

195
00:14:21,300 --> 00:14:26,520
Now that we understand how a single thread

196
00:14:26,520 --> 00:14:31,010
block works, we can now expand to multiple
thread blocks.

197
00:14:31,010 --> 00:14:37,440
Here we show a grid of threads that are
organized into n thread blocks.

198
00:14:37,440 --> 00:14:41,380
And each thread block still consists of
256 threads.

199
00:14:42,470 --> 00:14:49,430
So, here, every thread not only have a
thread index, but also a block index.

200
00:14:49,430 --> 00:14:55,490
The block index variable is called
blockidx.x.

201
00:14:55,490 --> 00:15:01,160
The thread index is called threadidx.x.

202
00:15:01,160 --> 00:15:06,280
These are predefined CUDA variables that
we can use in a

203
00:15:06,280 --> 00:15:12,220
kernel, and they actually are initialized
by the hardware for each thread.

204
00:15:12,220 --> 00:15:14,930
So you don't need to initialize these
variables

205
00:15:14,930 --> 00:15:19,280
because the system initializes these
variables for every thread.

206
00:15:19,280 --> 00:15:19,595
Now,

207
00:15:19,595 --> 00:15:24,240
when we formed a data index, now we need
to

208
00:15:24,240 --> 00:15:29,940
factor in both the thread index and the,
block index.

209
00:15:29,940 --> 00:15:33,980
So in order to calculate i, now we, we
take the block

210
00:15:33,980 --> 00:15:39,110
index and multiply that by the block
dimension, in this case, 256.

211
00:15:39,110 --> 00:15:44,600
And then we add it to the thread index, so
the,

212
00:15:44,600 --> 00:15:49,980
thread 0 in, in block 0 still has

213
00:15:49,980 --> 00:15:54,700
i value 0, because the block index is,
value is 0 in this

214
00:15:54,700 --> 00:15:59,930
case.
So, but if we look at, thread

215
00:15:59,930 --> 00:16:05,270
block 1, now all the threads are going to
be seeing

216
00:16:05,270 --> 00:16:10,090
thread blockidx.x value 256.

217
00:16:10,090 --> 00:16:15,381
So thread 0 in block 1 will actually have
a i

218
00:16:15,381 --> 00:16:20,830
value of 256 instead of 0.
And

219
00:16:20,830 --> 00:16:25,700
then, obviously, the next thread block,
thread 0 in thread block 2,

220
00:16:25,700 --> 00:16:30,610
will have for the i value of 512.

221
00:16:30,610 --> 00:16:35,098
So this is the, as a result, the i values

222
00:16:35,098 --> 00:16:40,114
of the first thread block will range from
0 to

223
00:16:40,114 --> 00:16:45,888
255.
And all the i values from thread

224
00:16:45,888 --> 00:16:50,928
block 1 will range from 256 to

225
00:16:50,928 --> 00:16:56,248
511.
As you can see, now all the threads

226
00:16:56,248 --> 00:17:01,982
form a uniform coverage of all the array
elements.

227
00:17:01,982 --> 00:17:07,188
0 to 255 in the first thread block and
then

228
00:17:07,188 --> 00:17:12,394
256 on to 511 in the next thread block,

229
00:17:12,394 --> 00:17:17,737
and 512 to, 780, 60,

230
00:17:17,737 --> 00:17:22,750
67 in the next, block.

231
00:17:22,750 --> 00:17:27,070
So, you wish, this kind of coverage is
what we call a linear

232
00:17:27,070 --> 00:17:32,520
coverage of a one-dimensional array.
By using this formula, we can make

233
00:17:32,520 --> 00:17:38,220
sure that every element of A, B, and C is
covered by one of the threads.

234
00:17:39,310 --> 00:17:45,050
In reality, we maybe, have, each thread to
cover more than one element

235
00:17:45,050 --> 00:17:48,910
and we will come back to this point, but
at, at, a later point.

236
00:17:48,910 --> 00:17:53,270
But, at this point, it's sufficient to
understand how,

237
00:17:53,270 --> 00:18:00,450
just how we can map every thread to a
unique, array index, array element.

238
00:18:00,450 --> 00:18:06,970
So, the threads within a thread block can
actually

239
00:18:06,970 --> 00:18:12,540
cooperate through sheer memory, atomic
operations, and barrier synchronization.

240
00:18:14,010 --> 00:18:18,340
At this point, it's sufficient for you to
just be,

241
00:18:18,340 --> 00:18:21,000
familiar with these three terms, because
we are

242
00:18:21,000 --> 00:18:23,870
going to actually go into much more
detail.

243
00:18:23,870 --> 00:18:28,390
Essentially, shared memory allows the
threads to exchange data, the atomic

244
00:18:28,390 --> 00:18:33,110
operations allow the threads to be able to
coordinate their updates

245
00:18:33,110 --> 00:18:37,340
to the same variables and the barrier
synchronization allow the threads

246
00:18:37,340 --> 00:18:41,860
to, to force each other, to force others
to wait for them.

247
00:18:41,860 --> 00:18:44,230
So all these activities

248
00:18:44,230 --> 00:18:48,810
allow a coordination of activities across
different

249
00:18:48,810 --> 00:18:53,170
threads.
However, threads in different blocks do

250
00:18:53,170 --> 00:18:58,980
not interact.
So thread 0 through 255 in thread

251
00:18:58,980 --> 00:19:03,822
block 0 cannot interact with the threads

252
00:19:03,822 --> 00:19:09,420
0-256 in thread block 1.
There's no real

253
00:19:09,420 --> 00:19:13,230
interaction that we can have between these
thread blocks.

254
00:19:13,230 --> 00:19:15,280
So this is going to be an important

255
00:19:15,280 --> 00:19:19,520
as we, to understand scalability in your
CUDA code.

256
00:19:21,410 --> 00:19:28,354
Now, the thread index and block index are
not just one-dimensional,

257
00:19:28,354 --> 00:19:34,600
indices.
In CUDA, each dimens-, each thread,

258
00:19:34,600 --> 00:19:39,340
block index can be a 1D, 2D, or 3D
variable.

259
00:19:39,340 --> 00:19:43,650
So, each thread index can also be a 1D,
2D,

260
00:19:43,650 --> 00:19:47,218
and 3D.
That's why in the previous, slide.

261
00:19:47,218 --> 00:19:52,723
We'll go back to the previous slide here.
When we talk about a thread index, I

262
00:19:52,723 --> 00:19:59,635
actually have blockidx.x, because we're
only using the first dimension

263
00:19:59,635 --> 00:20:02,421
of the block index variable.

264
00:20:02,421 --> 00:20:05,157
In reality, many applications operate on

265
00:20:05,157 --> 00:20:09,261
two-dimensional data, such as images,
three-dimensional

266
00:20:09,261 --> 00:20:11,009
data, such as volume, in a

267
00:20:11,009 --> 00:20:15,930
differential equation solver for
computational fluid dynamics.

268
00:20:15,930 --> 00:20:20,010
So that's why it's very convenient to be
able to

269
00:20:20,010 --> 00:20:25,640
use 1D, 2D, or 3D block indices and thread
indexes.

270
00:20:25,640 --> 00:20:29,686
So that we can map that conveniently to a
two-dimensional data or a

271
00:20:29,686 --> 00:20:36,640
three-dimensional data and maintain and
keep the program easy to read.

272
00:20:36,640 --> 00:20:42,498
So here I'm showing a two, two-dimensional
block structure and,

273
00:20:42,498 --> 00:20:47,920
within each block, a three-dimensional
thread structure.

274
00:20:47,920 --> 00:20:50,930
So, when we look at the grid, we see that
each block

275
00:20:50,930 --> 00:20:55,540
has two indices, the X and Y indices.

276
00:20:55,540 --> 00:20:59,130
And X is the first one, Y is the second
one.

277
00:20:59,130 --> 00:21:02,760
And, so, actually, it's the other way
around.

278
00:21:02,760 --> 00:21:07,470
X is the right, the right element and Y is
the left element.

279
00:21:07,470 --> 00:21:12,830
So, at the top, we have block 0, 0 and
block 0, 1.

280
00:21:12,830 --> 00:21:15,994
So Y equal to 0 for both blocks and

281
00:21:15,994 --> 00:21:21,370
X equal to 0 and 1.
And now, the second row

282
00:21:21,370 --> 00:21:26,020
has Y value of 1, and X will vary from 0
to 1.

283
00:21:26,020 --> 00:21:30,955
Obviously, X and Y could vary from 0 to a
very large

284
00:21:30,955 --> 00:21:36,402
number, usually in the, tens or even in
the hundreds.

285
00:21:36,402 --> 00:21:41,196
So now, we, each dimension in

286
00:21:41,196 --> 00:21:47,156
CUDA can grow to, 600, 2 to the 16th.

287
00:21:47,156 --> 00:21:51,638
Now, when we look at each thread in a
block, which I'm showing a

288
00:21:51,638 --> 00:21:57,529
three-dimensional, thread, thread
organization within the block.

289
00:21:57,529 --> 00:22:01,312
So I'm expending out block 1, 1 here and
I,

290
00:22:01,312 --> 00:22:06,162
I show that, in this toy example, there
are 16 threads,

291
00:22:06,162 --> 00:22:10,604
and each thread has a unique
three-dimensional ID.

292
00:22:10,604 --> 00:22:15,760
The X ID varies from zero to three, and
the Y ID

293
00:22:15,760 --> 00:22:20,950
ranges from zero to one, and Z ID ranges
from zero

294
00:22:20,950 --> 00:22:26,263
to one, and that gives us 16
possibilities.

295
00:22:26,263 --> 00:22:27,825
So, as you can see, we can

296
00:22:27,825 --> 00:22:32,298
combine a two-dimensional grid
organization with a three-dimensional

297
00:22:32,298 --> 00:22:34,428
block organization, or we can have

298
00:22:34,428 --> 00:22:40,540
three-dimension, three-dimension,
two-dimensional, and plus two-dimensional.

299
00:22:40,540 --> 00:22:44,410
So this all depends on the needs of your
application.

300
00:22:44,410 --> 00:22:46,490
As we mentioned, this kind of

301
00:22:46,490 --> 00:22:49,850
multidimensional index is very convenient,
or we

302
00:22:49,850 --> 00:22:52,395
need to do image processing, or solve

303
00:22:52,395 --> 00:22:56,080
three-dimensional partial differential
equations, and so on.

304
00:22:58,080 --> 00:23:02,860
So this concludes the first episode of the
introduction to CUDA.

305
00:23:02,860 --> 00:23:07,610
If you'd like to learn more about the
basic concepts of data

306
00:23:07,610 --> 00:23:12,930
parallelism and the, the, basic execution
model of CUDA, I

307
00:23:12,930 --> 00:23:18,402
would like to encourage you to read
Chapter Three of the textbook.

308
00:23:18,402 --> 00:23:19,220
Thank you.